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METHOD FOR CONTENT-BASED FILTERING OF MESSAGES BY ANALYZING TERM CHARACTERISTICS WITHIN 
A MESSAGE 

BACKGROUND OF THE INVENTION 
In this patent, the term "junk messages" is used 
to refer to both junk e-mail messages and junk newsgroup 
5 messages. 

Junk messages represent a major and growina 
problem for the Internet . and World Wide Web. Junk messages 
include many types of messages that the recipient does not 
wish to read, including messages containing unsolicited 
10 commercial advertisements, chain letters, scams and frauds, 
such as multi-level marketing schemes and get -rich-quick 
schemes, advertisements for adult services and spam. (Spam 
is a vernacular term for messages that are posted to an 
excessive number of newsgroups.) 
15 Junk messages are harmful because they shift the 

burden of determining importance from sender to recipient, 
externalizing the true costs of the junk. The sender has no 
direct incentive to consider the wishes of the recipient. 

Junk messages waste the recipient's time and 
20 money. It takes time to download, identify and discard the 
junk messages. This buries important messages , causing a 
loss of productivity. If the recipient pays for connect 
time and telephone calls, the junk messages cost the 
recipient money, akin to postage due advertisements. On 
25 flat-rate dial up services, the service provider pays for 
the junk messages in terms of wasted bandwidth and disk 
space. These costs are ultimately passed on tq' the. 
recipient. The problem will continue to grow as more pepple '' 
become connected to the';. Internet*. 
3 0 Most current ■ methods for filtering out .junk 

messages use the headers of the message to .identify the junk 
mail. These programs maintain extensive blacklists of the 
e-mail addresses, domain names and IP addresses of sources 
of junk messages and remove any messages from those sources. - 
35 They may also filter based on other header fields' 1 (e.g., 
peculiarities in '' the recipient ^ address) f or the telltale . 
signs of forged .message headers. . 'Comparing two ,of the 
largest blacklists ' with a large corpus of junk messages 
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found that this method identifies only about 70% of the junk 
messages . 

Another popular method is to filter messages which 
were transmitted -via blind carbon copy or a mailing list. 
5 Such . messages can be easily identified" because the 
. recipient's address does not appear in the recipient fields 
of the header;- but -then' the recipient must maintain a 
whitelist of legitimate sources of mail,- such as his or her 
mailing list * subscriptions and- -'the "e-mail addresses of 
10 colleagues who- might send a message Via blind carbon copy, 
to avoid filtering out legitimate messages. This heuristic 
would have caught only about 50% 'of .the junk -messages in our 
■ corpus . : 

To summarize, a blacklist is - a list of header 
15 specifiers used to block messages and a whitelist is a list 
of header specifiers used to allow messages which would 
otherwise be filtered out to pas's through the blockade. 

Unfortunately, blacklists have many problems. 
They must be constantly updated as the large-scale offenders 
2 0 frequently change domain names and forge return addresses. 
Many junk messages come from first-time offenders and hence 
cannot be detected using a blacklist . The offender can also 
address ' the messages - individually with randomly selected 
forged return addresses. Header based methods also cannot 
25 detect messages transmitted via a mailing list to which the 
recipient subscribes, . nor junk messages posted to 
newsgroups. The provider of- a blacklist faces the 
possibility of litigation for defamation and restraint of 
trade, especially if legitimate' users and domains are 
30 accidentally or' intentionally' included in the blacklist. 

• DESCRIPTION OF THE PRIOR ART 
W. Tietz, Electronic delivery of unwanted messages 
- • in open communications systems ,- NTZ (Germany) , 4 7 (2) : 74 -7, 
February 1994. - - ' 

35 Cynthia £)w6rk ; - '•■ and - Moni Naor,- - Pricing via 

processing or • combating Junk' Mail , Weizmahn Institute of 

- - ; - - 2 - 
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Science, Department of Applied Mathematics and Computer 
Science, Technical Report CS95-20, 1995. 

Douglas W. Oard and Gary Marchionini, A Conceptual 
Framework for Text Filtering , University of Maryland at 
5 College Park, Technical, Report CS-TR-3643, May. 1996. 

Jason Rennie, ifile mail filtering svstem , 
http : //www. cs. emu. edu/~ j r6b/if ile/if ile . 

U.S. Patent- .No. 5 ,,619, 648 entitled "Message 
Filtering Techniques" , -Lucent . • Technologies - Inc.. , filed 
10 November 30,. 1994,. issued April ■ 8 ,; 1997. 

•_U . S .. Patent No . 5 , 283 , 856 entitled "Event -Driven 
Rule-Based Messaging System" , Beyond Inc. , filed October 4, 
1991, issued February 1, 1994. See also related U.S. Patent 
No. 5., 555,346. 

15 U.S.. .Patent No. 5-, 627 , 764 ... entitled -."Automatic 

Electronic Messaging -System , With -Feedback, : and Work Flow 
Administration" , Banyan Systems.,^ Inc . , ' filed June 9 , 1993 , 
issued May 6, 1997. . v: : . 

U.S. Patent No . :i 5 ,.377 : , 354 . entitled "Method and 

20 System for Sorting • and Prioritizing Electronic Mail 
Messages", Digital Equipment., Corporation, filed June 8, 
1993, issued December 27., 1994... . 

There are numerous -patents ., dealing with variations 
on the TFIDF method, including U . S . Patents Nos.. ., 5 , 576 , 954 ; 

25 5,659,766; 5,687,364;. 5,371,807; and 5 , .675 ,8 19 .- The TFIDF 
computes the ratio of the frequency of each . term in a 
document (TF) with the percentage of documents in. which the 
term appears (IDF) . IDF stands , for inverse term frequency. 

TFIDF uses IDF to emphasize, terms which occur 

3 0 frequently, in the document but. relatively, rarely in the 
collection of documents. In contrast, TDTF disclosed herein 
tries to. emphasize - .terms which occur frequently in the 
message and which are good indicators, .of junk messages 
(i.e., frequently in junk messages and rarely in non-junk 

35 messages) .. TD ("term discriminability" ) provides a good 
indicator of junk messages .by measuring the precision of the 
terms for the specific purpose of classifying junk messages. 

. 3 . 
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TDTF computes the product of frequency of each term in the 
document (TF) with the term disriminability (TD) . 

Mail filters in popular mail programs like Eudora 
have always been able to filter messages based on the 
5 presence of specific keywords - -in the message body. One 
could, for example, ■ establish a Eudora filter that 
"automatically deletes any message' containing the word "sex". 
In fact-, we use this capability for processing the mail that 
a plugin implementing ' this invention classifies ■ as junk. 

10 The plugin -adds ■'■a* -unique 'keyword "to the message to indicate 
that it' is junk, arid the user can set up a Eudora filter 
that redirects the message to a special' mailbox, deletes it, 
or takes some " other " action on' the' message.- The present 
invention is more powerful than the simple Boolean keyword 

15" search in- that it- uses- an extended vocabulary, with or 
without term weights, to distinguish junk messages from non- 
junk messages. With the Eudora filters, it is an all-or- 
nothing affair. If the keyword is present, it is classified 
as junk.- If the keyword is not present, the message slips 

2 0 through the filter. The present invention measures the 

degree to which a- message should be classified as junk. 
There are many words, like "money", which are ambiguous as 
' to whether the message is junk or not. The present 
invention counts the frequency of occurrence of such terms, 
25 along with other common warning signs . of junk messages, to 
provide a qualitative measure of whether a message is junk 
or not . 

Although TFIDF, Naive Bayes, and similar methods 
have been used for filtering e-mail (see, for example, Jason 

3 0 Rennie's* ifile system)', they suffer from a sparse data 

problem. * If is very hard for document similarity metrics 
like TFIDF and Naive Bayes to classify documents when they 
-have very few exemplafsV of the class . Such metrics need 
'large quantities of data-* iri : order to work. 
' 35 We address the sparse data • problem by establishing 

a large, well -formulated query in advance by training on a 
large corpus of junk messages. Not only does this allow us 

~ 1 - 4 - 
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to accurately identify junk messages without relying on the 
user to compile- and maintain their own corpus of junk 
messages , v but it works immediately, right out of the box. 
The idea of .preparing a well -formulated query for a specif ic 
5 filtering task in advance represents an improvement to the 
state of the art. . It is not possible to do this for the 
user's own classification system, in general, but for a 
sharply focused and important problem like eliminating junk 
messages ,.. it is easy and effective... • 

10 . : SUMMARY OF THE ■ INVENTION • . . . . 

Brief ly v according to this , invention',-, there is 
provided a computer implemented .method -of filtering of junk 
messages .by. analyzing the- content . of the. message instead of 
or in addition to using the message headers. - . This method 

15 involves document ■. classif ication using a-- variety of 
information ; retrieval methods, -but with unusually large 
queries. The -term '-'queries", -as used, herein,-: refers to 
searches for terms in messages; (or other documents) that 
match a list- of terms (or lexicon) ... In this invention, a 

20 list of terms may include - multiple word n - grams . The 
present invention uses ^very .large queries (on. the order of 
250, 500 or 1,000 query terms or ; .more in the lexicon) to 
achieve extremely high accuracy ; in classifying documents. 
The key is to pick topics - for, which a large set- of, exemplars 

25 is available so that the large, .queries, can be constructed.. 
Besides using the invention to .filter -junk messages, other 
possible applications include identifying job announcements, 
categorizing classified -advertisements (e.g., "for sale" 
versus "wanted", real, estate, - automobiles and so on), 

3 0 appropriateness for children and other well-defined 
categories. The present . invention may also be used to 
classify web pages and newsgroup postings in addition to e- 
mail. Since the categories, are. static but are of widespread 
interest, the time invested in constructing large queries 

35 will- be worthwhile ^and can be .invested, by the software 
manufacturer instead of the end-user. 

- 5 - 
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Junk mail, for example,, is filtered by computing 
the sum of the product of the frequency of occurrence with 
the term weight for every term from the " term lexicon that 
. also appears in the message. The resulting sum is 
5 normalized by dividing the result by the total number of 
words (or the number of unique ~ words) in 'the document. In 
other words, it is the dot product of the term frecaiency 
vector with the term weight vector- -perhaps- normalized by 
- - document length. The key to the accuracy' of this method is 
10 a large lexicon. This method permits alternate desired term 
weighting schemes . 

According to a preferred- method, the document or 
message is broken up into equal size chunks of the same 

number of words," with the score for the document taken as 

15 the maximum score for any chunk in this document. The last, 
odd-sized chunk may be merged into the previous chunk. 
Typical chunk sizes may be 50, 100 and -200 words . 

According- to one embodiment, the term weights are 
uniformly set equal to 1. According to another embodiment, 
20 a term' s weight is its classification accuracy, as measured 
in a training corpus. *' Classification accuracy is the 
probability that the message is Junk given the Term is found 
in the message, "that is, P(Junk | Term) . The term weights 
are adjusted to occur- - above : a minimum term weight (e.g., 
25 .1%) so that "terms which' are hot' present in the training 
corpus have -non- zero term weights. in yet another 
embodiment, the : term " weights are the information gain, 
log (P (Term | Junk) ) . - This embodiment makes use of the Naive 
Bayes method, but modified to allow the use of word n-grams 
3 0 (bigrams-, trigrams, etc-'. )* : in addition- to word unigrams . 

■ A novel method"' disclosed herein uses word n-gram 
statistics (including unigram, bigram, trigram and mixed- 
length n-grams) on' Message- content to ' identify junk 
messagies. Another novel- - method disclosed herein involves 
35 using a product or term weights -with term frequencies. 



" ' - 6 - 
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■ ■ DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The present invention uses a- content -based method 
to identify, the .likelihood of a message being a junk message 
based on. the - content of the message itself. The languaae 
5 used in junk messages has characteristics that: make it 
detectable. These methods. offer a much higher accuracy than 
the prior -art in correctly -classifying messages as either 
junk or , non-junk,,-. ;The present ■: invention has an accuracy 
that surpasses; the. effectiveness .of header-based methods and 
10 is of sufficient accuracy to be, used in stand-alone fashion 
to filter junk messages. However, there is. no-reason why it 
cannot be combined with header-based-, methods, "and it is 
expected .that this combination .will be able to stOD 
virtually all junk messages.;. Because the. -method ..is . based on 
15 the content -, of . the message., with- a. rather .fine-grained 
filter..,, the junk- messages -.cannot be easily, .modified,.: to 
bypass .the filter.. 

The present invention automatically identifies 
whether a message, such as a piece. of e-mail or newsgroup 
20 posting, is junk; marks it as junk;, and either automatically 
discards the message or automatically- files it in a junk 
mail folder (directory or subdirectory) ; for later review and 
disposition by the user (with the name of v the folder 
designated, either by the program or by the user) . 
25 The present invention- includes, a .user-settable 

threshold that determines whether v a. message, is classified as 
junk or not. If the message/ s :; -bogosity score is -above the 
threshold, it is classified • as junk. Otherwise, it is 
classified as , : non-junk.. ..The. user can set the. threshold 
30 lower to let no junk through- but occasionally, misclassify 
real, messages as junk. . The user can set the threshold 
higher to catch most, but not : all , of the junk messages 
while not *misclassif ying- any of -the real mail or. the user 
can set -the. threshold somewhere between the two .thresholds. 
35 . . ... This threshold -may., be set .automatically to the 
value necessary to maximize the overall accuracy in 
classifying messages as junk or non-junk. Given a 

- 7 - 
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collection of messages classified correctly and a sec of 
misclassif ied messages, it is a straightforward process to 
f ind the threshold' value that minimizes the number of 
classification errors; Since the' 'number of messages 
5 classified as junk decreases as the threshold increases and 
the number of real messages classified as juhk decreases as 
the threshold decrease's ,' there is a threshold value that 
minimizes the number of classification errors. Common 
search methods, like" hill -climbing and binary search, can be 
10 used to find it. This is similar to the methods we 
described' f6r adjusting the term weights in the lexicon, but 
appli-es to the threshold value instead of the lexicon 
weight's . 

* There are' iany phrases which are quite common in 
15 junk messages but ' significantly less common in legitimate 
•correspondence. ' Examples include "credit card", "please 
pardon the ' intrusion" , "make* money fast", "extremely 
lucrative opportunity" / "dear adult webmaster", "completely 
legal", "opportunity of a lifetime", "check or money order", 
20 "credit repair", "very lucrative", "limited time offer", and 
"to be removed" . A lexicon of such phrases may be compiled 
through' a combination of automated methods and human 
judgment.. 

In " one embodiment of this invention, referred to 

2 5 herein as the "bogosity"' method , one "measures the degree to 

which the content ■ of the message relies on a restricted 
lexicon of terms common" in junk mail; This yields a "junk 
density" or bogosity* figure. The higher this figure, the 
greater the degree to which the message uses the telltale 

3 0 signs of junk, and hence * the greater the likelihood that the 

message is junk. Given a junk density threshold, the system 
can* classify as junk any message with a bogosity score above 

the threshold. 

*£he bogosity method breaks up the messages into, 
'35 say, 100 word chunks, and counts the number of word n-grams 
(multiple word phrases) in each chunk which also appear in 
the lexicon of phrases that' are indicative of junk messages. 

- 8 - 
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The result is normalized by dividing, it by the number of 
words in- the chunk. The default chunk size can be set by 
the user... Typically, the chunk size, will vary between 50 
and 200... The „ bogosity score^of . the chunk with the highest 
5 bcgosity score is used as the. overall bogosity score of the 
message. The last chunk, in the message may be less than the 
default chunk size. The bogosity method may ignore this 
chunk or merge it. in with the previous chunk depending on 
the number of words in the chunk and the number of chunks in 

10 the message. 

According . . to anqther embodiment,. . referred to 
herein as :: the TDTF method,. :: weights are applied to each 
lexicon entry according to the Term Discriminability 
(classification ..accuracy) learned from a training corpus. 

15. Lexicon entries that are .more indicative of junk will have 
higher weights than entries which are more . ambiguous in 
nature. Negative weights are ..jalso . permitted, to allow the 
. lexicon to include negative examples (e.g., good indicators 
of . non-junk) . This is the TDTF algorithm, where TD stands 

2 0 for term discriminability. and TF stands for term frequency. 

A,. variation on .the embodiments described uses a 
library of example .junk messages in case-based fashion. The 
idea is to use the exemplar messages as lexicons and to use 
an algorithm like bogosity, to. measure the similarity between 
25 the incoming e-mail and each of the messages in the library. 
If the similarity score for any junk message in the library 
with the incoming message exceeds a threshold, the incoming 
message would be classified as junk. This is similar in 
implementation, although somewhat different in conception, 

3 0 with the difference deriving from the use of the exemplar 

messages, themselves as the lexicons and the use of many 
smaller lexicons, (corresponding to each, of the exemplar 
messages) instead of one large lexicon. 

According to .yet another embodiment of this 
3 5 .invention, use is made of the Naive Bayes statistical method 
.that measures, the information gain of classifying the 
.messages using each , word from ^the training corpus and 

- 9 - 
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computes .the overall likelihood of each message. For 
example, the top 2 0 words in the junk class sorted by loc 
likelihood values ' are : money, report, business, order, 
orders, mail, e-mail, receive, free, send, credit, bulk, 
5 marketing, ' internet , program, cash/ service, people, 
opportunity and product. This matches our intuitions about 
what terms are good indicators of junk messages. The 
* ■ benefits of Naive Bayes are that -it -is a statistically well- 
founded technique-' which weights " according to ■ likelihood and 
10 incorporates' notions of positive and ■ negative weights by 
using separate' scores 'for junk and noh^junk and comparing 
the' two.-- -■■■ r . r - v 

A problem' with Naive* Bayes is the assumption that 
- words occur independently/ -' For example '," the word "report" 
15 may be ' a good indicatoi: of junk mail (many pyramid schemes 
" use this word) , but it' : also- filters out messages about 
progress reports. This problem is remedied by gathering 
statistics on word n-grams (e.g., word bigrams and trigrams) 
in addition to single words-. 

2 0 At a basic level, the bogosity, TDTF, and Naive 

Bayes methods are similar in implementation. They each 
maintain a lexicon of terms (single words, word bigrams, 
word trigrams and wo rd :: n-grams ' in general , as well as word 
n-grams with * stop words removed) with weights associated 
■25 with each term. For bogosity the weight is set equal to 1. 
For TDTF the weight is the trained classification accuracy 
(term discriminabiiity) of the term, which is equivalent to 
the probability that -the message is junk given the term, 
P(Junk | Term)*. * 
30 For Naive Bayes, the" weight is the information 

gain; : which is the logarithm of the 'probability of the term, 
given that the message- is junk/ " ' ; : ■ - 
log (P(Term | Junk) )'."-"' 

Given these weights, the score for a document (or 

3 5 a chunk of a document) '"is the dot - product (the sum of 

'products, a linear • combination of "products) of the term 



- 10 - 



BNSDOCIO: <WO 0026795A1_I_> 



WO 00/26795 



PCT/US99/24359 



frequencies with , the corresponding term weights, perhaps 
normalized by . document length. 

. . Various methods have been used on a corpus of junk 
and non-junk messages ,. computing the accuracy in classifying 
5 junk and non-junk, as well as the overall classification 
accuracy. It is important - not only, that the method .identify 
junk, but also that it not mistakenly identify non-junk as 
junk. . Those skilled in the . art -..can quickly write a program 
for scanning . the - corpus, of junk documents to develop the 

10 weights for terms found in the documents.. 

. . When the TDTF algorithm's, weights are. trained 

using different data than was used to construct the lexicon, 
some lexicon ..terms . might .not appear in the training data. 
This can happen when human judgment, is used to ; -add sirrrole 

15 variations to the : lexicon terms (e.g adding, a new term 
that corrects .a spelling error in .a lexicon -term) . The new 
term will not : necessarily - occur , in the training data and so 
might be. assigned to . a . score -of .0. xt is important to 
adjust the scores so that, this term has a small non-zero 

2 0 value . 

As noted previously, the. junk accuracy of the 
heuristic (user . not listed- as .a recipient) was about 50%, 
and the junk accuracy of blacklists, was about 70%. The 
bogosity embodiment .with- a -0. 20 . threshold -had a junk 

2 5 classification accuracy of about 90%, a non-junk 

classification accuracy of about 96% and an overall 
classification .accuracy, of about .9.5%. (Raising the 

threshold reduces the junk . classification accuracy while 
increasing the non-junk classification accuracy. The 0*25 

3 0 threshold seemed like a reasonable- compromise.) The TDTF 

method with a threshold ..of .0 ,20 had junk,, non-junk and 
overall classification accuracy scores of about 91%, 96% and 
95%. Increasing the threshold to 0.25 reduced the junk 
accuracy., to. about. . 81% : . but increases the non-junk 
35 classification accuracy to -98%, .with an overall accuracy of 
about 97%.. The method using. Naive Bayes with unigrams had 
a junk classification accuracy of about. 97%, non-junk about 
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96% and overall 96%, The method using Naive Bayes with 
bigrams had a junk classification accuracy, of about 98%, a 
real classification accuracy of about .98% and an overall 
classification accuracy of about 98%. Thus, the present 
5 invention represents a significant, improvement to the state 
of the art . 

Alternate implementations would involve several 
variations on the theme. For example, one implementation 
would train the lexicon on the user's own. e-mail when the 
10 user installed the program. Another implementation would 
provide a ready-made - lexicon and , weights > ' and would allow 
the user to add new terms to the lexicon, delete terms from 
the lexicon and manually adjust the weights. Yet another 
implementation would also automatically adjust the weights 

15 when presented with new examples of junk and non-junk by 
small increments (for positive examples) • and small 
decrements (for negative examples) for the terms found in 
the example. The increments and decrements would be 
computed using a variety of methods, such as gradient 

2 0 descent . 

Prototypes of each of these methods have been 
implemented in Perl and C. It has been found it is quite 
useful in practice with Unix mail. It has been implemented 
as a plugin for the popular Windows and Macintosh mail 
25 program Eudora. The latest version also includes adjustable 
thresholds, whitelists and' blacklists, and can highlight 
significant keywords in the e-mail message, 

A copy of the PERL source code for a stand-alone 
version of bogosity and part of its lexicon follow. For an 
explanation of the" PERL language, reference is made to 
Learning Perl, Second Edition , by. Randal L.. Schwartz and Tom 
Christiansen (O' Reilly & Associates; Inc. 1997) . 
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SOURCE CODE FOR BOGOSITY.PL 

$rootdir !'C: \\usr\\mkant\\Bogosity\\ " ; 
$mailfile = $ARGV[0]; 

$mailfile = "mail.txt" if ( ! $ma'ilf ile) ; 
5 # the file of bogus words and phrases 
$phrasefile = "bogosity . txt 11 ; 

# number of words per chunky 
$chunksize = $ARGV[1]; 
$chunksize = 200 if ( ! $chunksize) ; 

10 # Let -ly and -est contribute ■ to "bogosity 

$ lye st = 1; . iV . . . .; . • , v 

# For counting I and ? 

$maxrictus* = 0; ; 

$rictus = 0; 
15 "# Load the phrase file. 

open (PHRASE, - "$rootdir$phrasef ile") ; - 
foreach $,phrase r (<PHRASE>) 
chop $ phr a s e ; 
$Dhrases { ,r $phrase " } •= 1.; • 
20 } * 

close (PHRASE) ; 

# process the mail" 1 --* 
$maxbogosity = 0; 
$wordcount = 0 ; 

25 $bogosity= 0; - " ' ; 

$prev = 

$pprev = 1111 ; 

$ppprev = " " ; 

Spppprev = " 11 
30 open (MAIL, "$rootdir$mailf ile" ) ; 

foreach $line (<MAIL>) { 
chop $line; 

if ^($line i~ From : j "News 

groups: ["Subject: j "Date : {"Organization : | "Lines : | "Message- 1 
35 D:.| "References: j Mime-Version :.{ X.- ; - * : { NNTP-Posting-Host : j "Pa 
. th: {Content- .*:/) { 

©lwords = split (/\s+/ , $i'ine) ; 
foreach $word (©lwords) {.. . • • 
$lword = $word; 
40 $lword s/\'s$//; 

$lword s/\W$|\.$//; 

$lword «-* s/"\Wj"\.// ; 
$lword tr/A-Z/a-z/; 
..if .{length ("$word") .< .25 && 
45 $word " i - / -+\@. +/i # was 

ca, com, de, edu.gov > mil, net, org ,uk, us. "*-... 

$word 1 r- . /rec\./i 
$word !- /comp\./i && 
$word !- /soc\./i && 
50 $word !~ /sci\./i 

$word !- /\.forsale/i 
$word I - /\.general/i ScSc 
$word !- /misc\./i 
$word !~ /alt\./i && 
55 $word !~ /news\./i 
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$word /<URL:/i 1 
$word !- /http:\/\//i 
$word !- /ftp:\/\//i Sc& 
$word !- /name\s*=/i 
5 $word !- /href\s*=/i 

$word !- /gopher :\/\//i 

$word 1- / A / ScSc 

$lword !- / A [\d\- <>,\A$]*$/ 
$lword ne "") { 
10 $wordcount++ ; 

if ($phrases{ n $lword"} == 1) { 

$bogosity++; 
} elsif ($lyest == 1 

($lword =- /ly$ jest$/) ) { 
15 $bogosity++; 

} 

if ($phrases("$prev $lword !t } == 1) { 
$bogosity++; 

} 

20 if ($phrases{ n $pprev $prev $lword" } 1) { 

$bogosity++; 

} , 

if ($phrases{ "$pprev $pprev $prev $lword"} == 1) 

25 $boaosity++; 

} 

if ($phrases{ n $pppprev $ppprev $pprev Sprev 
$lword"} == 1) { 

$bogosity++; 

30 } 

if ($word =- /\?$|\!$/) { 
. $rictus++; 

} 

35 

if ($wordcount >= $ chunks ize) { 

if ($bogosity > $maxbogosity) { 
$maxbogosity = $bogosity; 

40 if ($rictus > $maxrictus) { 

$maxrictus = $rictus; 

} 

$wordcount = 0 ; 
$bogosity = 0; 
45 $rictus = 0; 

} 

$pppprev = $ppprev; 
$ppprev = $pprev ; 
$pprev = $prev; 
50 $prev = $lword; 



} 



} 



} 



} 

55 close (MAIL) 
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15 



20 



25 



30 



35 



40 



45 



50 



55 



printf "Maximum Bogosity: %.3f ($maxbogosity/$ chunks ize) \n" , 
$maxbogosity/$chunksize ; 

printf r.'.'Maximum Chunk Rictus (!?): %.3f 

($maxrictus/$chunksize) \n" , $maxrictus/$chunksize ; 

BOGOSITY.TXT (partial) 



i 

i i 



$ , \ ; 

$$ , .. 

$$$ ' ... 

$$$$ 

$$$$$ . • • . ' ■ : • 

$$$$$$ - , - : 

$$$$$$$ 

$2 . 7 billion -. : ..- r 
$50, 000 

$50,000 dollars or more 

$6 . 6 billion 

$70,000 

*this* mailing list 
1,000 

1, 000, 000 . . , 

10,000 

100% 

100% committed 

100% legal 

100% of the time 

100% satisfied 

100,000 

1000% 

1302 n - - . 

1342 / - • : : _ , 

18 years old 
1st level 
1st time 

200% ; ; * ^ m 

2nd level 

3 level 

300% ' . . . : 

3rd level : 

4 level 
400% 

4th level 

500% ~; 
5th level . 
8 level 

90 -day limited warranty 
Four- level 

a brand new social security number 
a copy of 
a couple of 
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a credit card 
a deep breath 
a different report 
a few 
5 a few hours 
a few minutes 
a large amount of money 
a leading 
a letter 
10 a limited number 
a list 

a little bit 

a little time 

a lot 
15 a lot easier 

a lot more : - 

a lot of 

a lot of money 

a lot of time 
2 0 a mail box 

a mailbox 

a mailing list 

a mailing list company 

a mailing of 

2 5 a miracle 

a month 
a must 
a sign 

a significant advantage 

3 0 a sound way 

a special program 
a testimonial 
a ton of money 
a top leader 

3 5 a total of 

a total of perhaps 
a variety of 
ability 
about to make 
40" absolutely - 

absolutely convinced 
absolutely free 
absolutely guarantee 1 
absolutely, guaranteed 

4 5 absolutely no credit check 

* absolutely no other "fees 
absolutely no risk., 
absolutely nothing 
abuse 

50. accept all .credit, cards. 

accept all major credit cards 

accept american express 

accept amex 

. accept cash 
55 accept check 

accept checks 
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accept credit cards 

accept creditcards 

accept major credit cards 

accept master 
-5 accept mastercard 

accept money orders 

accept payment 

accept personal checks 

accept visa 
10 accept visa/master . 

access fees 

account executive 

account number 

account representative 
15 acquiring e-mail lists 

acquiring email lists 

act fast 

act now 

action 
2 0 activity level 

ad " . 

ad banner 
ad below 
ad campaign 

2 5 ad length system 

added bonus 
additional income 
address 
address city 

3 0 addressed 

addresses 

addresses accurately 

The program flow can generally : be described as 
follows. The lexicon file containing the words and phrases 
35 characteristic of junk mail, "bogosity . txt '! , and the file 
containing the mail, "mail. txt", are opened. A' word is 
input from the mail. txt file and compared to the;,lexicon . 
If a match is found the score for that word (in this case 
always the same) is added to the, raw. score . The f irst word 

4 0 is kept so that it along with the next, word can be- compared 

to double-word phrases in the lexicon. Words and phrases 
(in this case up to five-word phrases) are . compared to the 
lexicon and scored. When the maximum chunk size has been 
read and compared to the lexicon, ■* the. total score -is - divided 
4 5 by the chunk size. The next" chunk .* is then analyzed. A 
running maximum score for the chunks of the message is kept 
and used as the score for the message. If the last . chunk is 
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10 



too short, it is merged with the next -to -last chunk or 
discarded. Finally, a line of text is added to the message 
to tag it as junk or not. Most mail programs have the 
capability of f iling or discarding- messages * based upon this 
added line of text. This program is easily modified to 
implement the TDTF method and the Naive Bayes methods. The 
only difference' is- the use of diff erent . weights for terms in 
the lexicon.* " 

Having thus- defined' our invention in the detail 
and particularity -required by ' the" ' Patent Laws, what is 
desired protected by Letters Patent is set forth in the 
following - claims \ • 
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WE CLAIM : - . 

1 . : A computer implemented- method for filtering 
of junk .messages comprising analyzing the content of the 
messages. 

. 2. A . computer- * implemented method for 
classification of a document as a junk message comprising 
analyzing the content . of documents -. for the presence or 
absence of more than 250 : - words, and/ or .multi-pie word n - grams . 

3 . A computer implemented - method for 
classification of a document as a junk message comprising 
the steps of: 

a) computing the sum of the product of the 
5 frequency of occurrence with an assigned term weight for 

every term and/or multiple word n-grams from a term lexicon 
that also appears in a document; and 

b) assigning a score to the document based on the 
resulting sum. 

4. The method according to claim 3, comprising 
the step of normalizing the resulting sum by dividing the 
result by the total number of words (or the number of unique 
words ) in the document . 

5 . The method according to claim 3 , wherein the 
document is broken up into equal sized chunks of the same 
number of words, with the score for the document as the 
maximum score for any chunk in the message. 

6. The method according to claim 3, 4 or 5, 
comprising the further step of comparing the score assigned 
to the document to an adjustable threshold and classifying 
the document on the basis of that comparison. 
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1. The method according to claim 3, 4 or 5, 
wherein the term weights are uniformly set' eoual to 1. 

8. The method according to claim 3, 4 or 5, 
wherein a term's weight is its classification' accuracy 
P(Junk | Term) , as measured in a training corpus. 

9. The method according to claim 3, 4 or 5, 
wherein -the term- weights are' the information gain, 
log(p-{Term j Junk) )• 'as measured in a training corpus, 

10. The - method ' according' to claim' 3 , 4 or 5, 
wherein the "term weights 'are supplied by the ' dependency tree 

■ algorithm. * 

11. The method according to claim 3, 4 or 5, with 
.any monotonic modification of the weights, 

12. The method according to claim" 3, 4 or 5, 
wherein the lexicon is comprised of a plurality of lexicons 
and a score is assigned to the* document " based upon maximum 
score using any one of the plurality of lexicons. 

13 : The method according to claim 12, wherein the 
plurality' of lexicons includes one or more junk messages. 

14. The method according to claim 3, 4 or 5 
applied to e-mail documents or the like, wherein message 
headers are compared with a blacklist to block messages that 
match header-based constraints. 

15. The method according to claim 3, 4 or 5 
applied to e-mail documents' or the like, wherein message 
headers are compared with a whitelist to pass through 
messages that match header-based constraint . 
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16. The- method according to claim 14, wherein 
only documents that are not blocked by the blacklist 
constraint are classified. 

.17. The method according to claim 15, wherein 
only documents that pass the whitelist constraint are 
classified. 

. . 18 . ... The. method according . to claim 6... applied to e- 
mail documents . or the., like, -wherein the -user can set the 
threshold to let no junk mail through but occasionally 
misclassify a. non-junk message,, as, -junk or the user can set 
the threshold., tp block most, . but not all, junk messages 
while not misclassif ying and non- junk messages or the 
threshold can be set somewhere therebetween. 

19. The, method .according to claim 1 7 comprising 
a step for assigning a score to the document based on the 
content thereof which .uses . anc}. falls with the likelihood 
that the message is junk and a step. for., comparing the score 
to a threshold to determine whether.- the message should be 
classified as junk. 

20. The method according to claim 19, comprising 
the step for adjusting the threshold to control the balance 
between identifying junk messages and misclassifying non- 
junk messages. 

21. The method according to claim 19 comprising 
a step for automatically setting the threshold to minimize 
all classification errors. 

22..- The method, according to claim 3,-4. or 5, 
wherein the lexicon is derived from a : training set of 
documents. . , 
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23. The method according to claim 22, wherein 
every term and/or multi-word n-gram in the .lexicon has at 
least a minimum value. 
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